Introduction
Estimated reading time: 47 minutesIntroduction
Collecting the Data
Import pandas and numpy
import pandas as pd
import numpy as np
Let’s take a look at the files in our input directory, using the convenient shell commands in ipython.
!ls data
[31m311-service-requests.csv[m[m
[31mAll-Web-Site-Data-Audience-Overview.xlsx[m[m
[31mAussie_Wines_Plotting.csv[m[m
[31mMarch-2017-forecast-article.xlsx[m[m
[31mTraffic_20170306-20170519.xlsx[m[m
[31mTweets.csv[m[m
[31mairbnb_session_data.txt[m[m
[31mcustomer-status.xlsx[m[m
[31mdebtors.csv[m[m
[31mexcel_sample.xlsx[m[m
[31mgeneral-ledger-sample.xlsx[m[m
[31mmn-budget-detail-2014.csv[m[m
[31mpnl.xlsx[m[m
[31mpopulation.xlsx[m[m
[31msales-estimate.xlsx[m[m
[31msales-feb-2014.xlsx[m[m
[31msales-jan-2014.xlsx[m[m
[31msales-mar-2014.xlsx[m[m
[31msales_data_types.csv[m[m
[31msales_transactions.xlsx[m[m
[31msalesfunnel.xlsx[m[m
[31msample-sales-reps.xlsx[m[m
[31msample-sales-tax.csv[m[m
[31msample-salesv3.xlsx[m[m
[31mtb_import.xlsx[m[m
[31m~$general-ledger-sample.xlsx[m[m
There are a lot of files, but we only want to look at the sales .xlsx files.
!ls data/sales-*-2014.xlsx
[31mdata/sales-feb-2014.xlsx[m[m [31mdata/sales-jan-2014.xlsx[m[m [31mdata/sales-mar-2014.xlsx[m[m
Use the python glob module to easily list out the files we need
import glob
glob.glob("data/sales-*-2014.xlsx")
['data/sales-feb-2014.xlsx',
'data/sales-jan-2014.xlsx',
'data/sales-mar-2014.xlsx']
This gives us what we need, let’s import each of our files and combine them into one file.
Panda’s concat and append can do this for us. I’m going to use append in this example.
The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame.
all_data = pd.DataFrame()
for f in glob.glob("data/sales-*-2014.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good.
all_data.shape
(384, 7)
all_data.describe()
account number | quantity | unit price | ext price | |
---|---|---|---|---|
count | 384.000000 | 384.000000 | 384.000000 | 384.000000 |
mean | 478125.989583 | 24.372396 | 56.651406 | 1394.517344 |
std | 220902.947401 | 14.373219 | 27.075883 | 1117.809743 |
min | 141962.000000 | -1.000000 | 10.210000 | -97.160000 |
25% | 257198.000000 | 12.000000 | 32.612500 | 482.745000 |
50% | 424914.000000 | 23.500000 | 58.160000 | 1098.710000 |
75% | 714466.000000 | 37.000000 | 80.965000 | 2132.260000 |
max | 786968.000000 | 49.000000 | 99.730000 | 4590.810000 |
Alot of this data may not make much sense for this data set but I’m most interested in the count row to make sure the number of data elements makes sense.
all_data.head()
account number | name | sku | quantity | unit price | ext price | date | |
---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | B1-20000 | 7 | 33.69 | 235.83 | 2014-02-01 09:04:59 |
1 | 412290 | Jerde-Hilpert | S1-27722 | 11 | 21.12 | 232.32 | 2014-02-01 11:51:46 |
2 | 412290 | Jerde-Hilpert | B1-86481 | 3 | 35.99 | 107.97 | 2014-02-01 17:24:32 |
3 | 412290 | Jerde-Hilpert | B1-20000 | 23 | 78.90 | 1814.70 | 2014-02-01 19:56:48 |
4 | 672390 | Kuhn-Gusikowski | S1-06532 | 48 | 55.82 | 2679.36 | 2014-02-02 03:45:20 |
It is not critical in this example but the best practice is to convert the date column to a date time object.
all_data['date'] = pd.to_datetime(all_data['date'])
Combining Data
Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company’s customer segmentation strategy or some other mechanism for identifying their customers.
First, we read in the data.
status = pd.read_excel("data/customer-status.xlsx")
status
account number | name | status | |
---|---|---|---|
0 | 740150 | Barton LLC | gold |
1 | 714466 | Trantow-Barrows | silver |
2 | 218895 | Kulas Inc | bronze |
3 | 307599 | Kassulke, Ondricka and Metz | bronze |
4 | 412290 | Jerde-Hilpert | bronze |
5 | 729833 | Koepp Ltd | silver |
6 | 146832 | Kiehn-Spinka | silver |
7 | 688981 | Keeling LLC | silver |
8 | 786968 | Frami, Hills and Schmidt | silver |
9 | 239344 | Stokes LLC | gold |
10 | 672390 | Kuhn-Gusikowski | silver |
11 | 141962 | Herman LLC | gold |
12 | 424914 | White-Trantow | silver |
13 | 527099 | Sanford and Sons | bronze |
14 | 642753 | Pollich LLC | bronze |
15 | 257198 | Cronin, Oberbrunner and Spencer | gold |
We want to merge this data with our concatenated data set of sales. We use panda’s merge function and tell it to do a left join which is similar to Excel’s vlookup function.
all_data_st = pd.merge(all_data, status, how='left')
all_data_st.head()
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | B1-20000 | 7 | 33.69 | 235.83 | 2014-02-01 09:04:59 | NaN |
1 | 412290 | Jerde-Hilpert | S1-27722 | 11 | 21.12 | 232.32 | 2014-02-01 11:51:46 | bronze |
2 | 412290 | Jerde-Hilpert | B1-86481 | 3 | 35.99 | 107.97 | 2014-02-01 17:24:32 | bronze |
3 | 412290 | Jerde-Hilpert | B1-20000 | 23 | 78.90 | 1814.70 | 2014-02-01 19:56:48 | bronze |
4 | 672390 | Kuhn-Gusikowski | S1-06532 | 48 | 55.82 | 2679.36 | 2014-02-02 03:45:20 | silver |
This looks pretty good but let’s look at a specific account.
all_data_st[all_data_st["account number"]==737550].head()
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
15 | 737550 | Fritsch, Russel and Anderson | S1-47412 | 40 | 51.01 | 2040.40 | 2014-02-05 01:20:40 | NaN |
25 | 737550 | Fritsch, Russel and Anderson | S1-06532 | 34 | 18.69 | 635.46 | 2014-02-07 09:22:02 | NaN |
66 | 737550 | Fritsch, Russel and Anderson | S1-27722 | 15 | 70.23 | 1053.45 | 2014-02-16 18:24:42 | NaN |
78 | 737550 | Fritsch, Russel and Anderson | S2-34077 | 26 | 93.35 | 2427.10 | 2014-02-20 18:45:43 | NaN |
80 | 737550 | Fritsch, Russel and Anderson | S1-93683 | 31 | 10.52 | 326.12 | 2014-02-21 13:55:45 | NaN |
This account number was not in our status file, so we have a bunch of NaN’s. We can decide how we want to handle this situation. For this specific case, let’s label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column.
all_data_st['status'].fillna('bronze',inplace=True)
all_data_st.head()
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | B1-20000 | 7 | 33.69 | 235.83 | 2014-02-01 09:04:59 | bronze |
1 | 412290 | Jerde-Hilpert | S1-27722 | 11 | 21.12 | 232.32 | 2014-02-01 11:51:46 | bronze |
2 | 412290 | Jerde-Hilpert | B1-86481 | 3 | 35.99 | 107.97 | 2014-02-01 17:24:32 | bronze |
3 | 412290 | Jerde-Hilpert | B1-20000 | 23 | 78.90 | 1814.70 | 2014-02-01 19:56:48 | bronze |
4 | 672390 | Kuhn-Gusikowski | S1-06532 | 48 | 55.82 | 2679.36 | 2014-02-02 03:45:20 | silver |
Check the data just to make sure we’re all good.
all_data_st[all_data_st["account number"]==737550].head()
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
15 | 737550 | Fritsch, Russel and Anderson | S1-47412 | 40 | 51.01 | 2040.40 | 2014-02-05 01:20:40 | bronze |
25 | 737550 | Fritsch, Russel and Anderson | S1-06532 | 34 | 18.69 | 635.46 | 2014-02-07 09:22:02 | bronze |
66 | 737550 | Fritsch, Russel and Anderson | S1-27722 | 15 | 70.23 | 1053.45 | 2014-02-16 18:24:42 | bronze |
78 | 737550 | Fritsch, Russel and Anderson | S2-34077 | 26 | 93.35 | 2427.10 | 2014-02-20 18:45:43 | bronze |
80 | 737550 | Fritsch, Russel and Anderson | S1-93683 | 31 | 10.52 | 326.12 | 2014-02-21 13:55:45 | bronze |
Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability.
Using Categories
One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -
“Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.”
For our purposes, the status field is a good candidate for a category type.
You must make sure you have a recent version of pandas installed for this example to work.
pd.__version__
'0.22.0'
First, we typecast it to a category using astype.
all_data_st["status"] = all_data_st["status"].astype("category")
This doesn’t immediately appear to change anything yet.
all_data_st.head()
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | B1-20000 | 7 | 33.69 | 235.83 | 2014-02-01 09:04:59 | bronze |
1 | 412290 | Jerde-Hilpert | S1-27722 | 11 | 21.12 | 232.32 | 2014-02-01 11:51:46 | bronze |
2 | 412290 | Jerde-Hilpert | B1-86481 | 3 | 35.99 | 107.97 | 2014-02-01 17:24:32 | bronze |
3 | 412290 | Jerde-Hilpert | B1-20000 | 23 | 78.90 | 1814.70 | 2014-02-01 19:56:48 | bronze |
4 | 672390 | Kuhn-Gusikowski | S1-06532 | 48 | 55.82 | 2679.36 | 2014-02-02 03:45:20 | silver |
Buy you can see that it is a new data type.
all_data_st.dtypes
account number int64
name object
sku object
quantity int64
unit price float64
ext price float64
date datetime64[ns]
status category
dtype: object
Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically.
all_data_st.sort_values(by=["status"]).head()
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | B1-20000 | 7 | 33.69 | 235.83 | 2014-02-01 09:04:59 | bronze |
196 | 218895 | Kulas Inc | S2-83881 | 41 | 78.27 | 3209.07 | 2014-01-20 09:37:58 | bronze |
197 | 383080 | Will LLC | B1-33364 | 26 | 90.19 | 2344.94 | 2014-01-20 09:39:59 | bronze |
198 | 604255 | Halvorson, Crona and Champlin | S2-11481 | 37 | 96.71 | 3578.27 | 2014-01-20 13:07:28 | bronze |
200 | 527099 | Sanford and Sons | B1-05914 | 18 | 64.32 | 1157.76 | 2014-01-20 21:40:58 | bronze |
We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering.
all_data_st["status"].cat.set_categories([ "gold","silver","bronze"],inplace=True)
Now, we can sort it so that gold shows on top.
all_data_st.sort_values(by=["status"]).head()
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
68 | 740150 | Barton LLC | B1-38851 | 17 | 81.22 | 1380.74 | 2014-02-17 17:12:16 | gold |
63 | 257198 | Cronin, Oberbrunner and Spencer | S1-27722 | 28 | 10.21 | 285.88 | 2014-02-15 17:27:44 | gold |
207 | 740150 | Barton LLC | B1-86481 | 20 | 30.41 | 608.20 | 2014-01-22 16:33:51 | gold |
61 | 740150 | Barton LLC | B1-20000 | 28 | 81.39 | 2278.92 | 2014-02-15 07:45:16 | gold |
60 | 239344 | Stokes LLC | S2-83881 | 30 | 43.00 | 1290.00 | 2014-02-15 02:13:23 | gold |
all_data_st["status"].describe()
count 384
unique 3
top bronze
freq 172
Name: status, dtype: object
For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values.
all_data_st.groupby(["status"])["quantity","unit price","ext price"].mean()
quantity | unit price | ext price | |
---|---|---|---|
status | |||
gold | 24.375000 | 53.723889 | 1351.944583 |
silver | 22.842857 | 57.272714 | 1320.032214 |
bronze | 25.616279 | 57.371163 | 1472.965930 |
Of course, you can run multiple aggregation functions on the data to get really useful information
all_data_st.groupby(["status"])["quantity","unit price","ext price"].agg([np.sum,np.mean, np.std])
quantity | unit price | ext price | |||||||
---|---|---|---|---|---|---|---|---|---|
sum | mean | std | sum | mean | std | sum | mean | std | |
status | |||||||||
gold | 1755 | 24.375000 | 14.575145 | 3868.12 | 53.723889 | 28.740080 | 97340.01 | 1351.944583 | 1182.657312 |
silver | 3198 | 22.842857 | 14.512843 | 8018.18 | 57.272714 | 26.556242 | 184804.51 | 1320.032214 | 1086.384051 |
bronze | 4406 | 25.616279 | 14.136071 | 9867.84 | 57.371163 | 26.857370 | 253350.14 | 1472.965930 | 1116.683843 |
So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.
Maybe we should look at how many bronze customers we have and see what is going on.
What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.
I’m purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean.
all_data_st.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1,7]].groupby(["status"])["name"].count()
status
gold 4
silver 7
bronze 9
Name: name, dtype: int64
Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers.
all_data_st.head(4)
account number | name | sku | quantity | unit price | ext price | date | status | |
---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | B1-20000 | 7 | 33.69 | 235.83 | 2014-02-01 09:04:59 | bronze |
1 | 412290 | Jerde-Hilpert | S1-27722 | 11 | 21.12 | 232.32 | 2014-02-01 11:51:46 | bronze |
2 | 412290 | Jerde-Hilpert | B1-86481 | 3 | 35.99 | 107.97 | 2014-02-01 17:24:32 | bronze |
3 | 412290 | Jerde-Hilpert | B1-20000 | 23 | 78.90 | 1814.70 | 2014-02-01 19:56:48 | bronze |
We will start over and import a clean file
df = pd.read_excel("data/sample-sales-reps.xlsx")
## Set default commision of 3%
df["commission"] = .03
df.head()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.03 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.03 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.03 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.03 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.03 |
Since shirts are high margin, adjust all products in the shirt categort with a commission rate of 5%
df.loc[df["category"] == "Shirt", ["commission"]] = .05
df.head()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.03 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.05 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.05 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.05 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.05 |
Since there is a special program for selling 10 or more belts in a transaction, you get 7% commission!
df.loc[(df["category"] == "Belt") & (df["quantity"] >= 10), ["commission"]] = .04
df.head()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.04 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.05 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.05 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.05 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.05 |
Finally, some transactions can get a bonus and a commission increase.
df["bonus"] = 0
df.loc[(df["category"] == "Shoes") & (df["ext price"] >= 1000 ), ["bonus", "commission"]] = 250, 0.045
Calculate the compensation at the line item level
df["comp"] = df["commission"] * df["ext price"] + df["bonus"]
df.head()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | bonus | comp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.04 | 0 | 67.2524 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.05 | 0 | 11.7105 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.05 | 0 | 28.6995 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.05 | 0 | 54.1740 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.05 | 0 | 18.1600 |
Calculate the commissions by sales rep
df.groupby(["sales rep"])["comp"].sum().round(2)
sales rep
Ansley Cummings 2699.69
Beth Skiles 3664.16
Esequiel Schinner 12841.28
Loring Predovic 13115.42
Shannen Hudson 6541.78
Teagan O'Keefe 10931.30
Trish Deckow 7641.91
Name: comp, dtype: float64
df["date"] = pd.to_datetime(df['date'])
df["month"] = df["date"].dt.month
Now what about on a monthly basis
df.head()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | bonus | comp | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.04 | 0 | 67.2524 | 11 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.05 | 0 | 11.7105 | 2 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.05 | 0 | 28.6995 | 8 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.05 | 0 | 54.1740 | 1 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.05 | 0 | 18.1600 | 8 |
df.groupby(["month","sales rep"])["comp"].sum().round(2)
month sales rep
1 Ansley Cummings 157.10
Beth Skiles 78.09
Esequiel Schinner 1481.78
Loring Predovic 655.96
Shannen Hudson 319.13
Teagan O'Keefe 732.10
Trish Deckow 305.58
2 Ansley Cummings 347.06
Beth Skiles 1262.36
Esequiel Schinner 741.87
Loring Predovic 1794.22
Shannen Hudson 524.58
Teagan O'Keefe 893.54
Trish Deckow 430.81
3 Ansley Cummings 362.13
Beth Skiles 439.53
Esequiel Schinner 1323.87
Loring Predovic 1183.59
Shannen Hudson 474.90
Teagan O'Keefe 1064.76
Trish Deckow 796.12
4 Ansley Cummings 123.27
Beth Skiles 121.12
Esequiel Schinner 1478.78
Loring Predovic 907.41
Shannen Hudson 514.77
Teagan O'Keefe 593.64
Trish Deckow 347.04
5 Ansley Cummings 101.29
Beth Skiles 347.73
...
8 Teagan O'Keefe 477.33
Trish Deckow 1493.39
9 Ansley Cummings 109.70
Beth Skiles 145.43
Esequiel Schinner 178.20
Loring Predovic 978.46
Shannen Hudson 765.28
Teagan O'Keefe 1086.80
Trish Deckow 435.42
10 Ansley Cummings 510.49
Beth Skiles 204.40
Esequiel Schinner 994.76
Loring Predovic 1954.61
Shannen Hudson 990.48
Teagan O'Keefe 1111.47
Trish Deckow 625.12
11 Ansley Cummings 80.41
Beth Skiles 464.13
Esequiel Schinner 928.05
Loring Predovic 1253.52
Shannen Hudson 567.31
Teagan O'Keefe 554.03
Trish Deckow 281.69
12 Ansley Cummings 288.82
Beth Skiles 131.40
Esequiel Schinner 1235.69
Loring Predovic 660.44
Shannen Hudson 421.24
Teagan O'Keefe 421.16
Trish Deckow 848.57
Name: comp, Length: 84, dtype: float64
You can do this for the entire numerical dataframe, withou creating a month variable like follows
df.set_index('date').groupby('sales rep').resample("M").sum().head(20)
account number | quantity | unit price | ext price | commission | bonus | comp | month | ||
---|---|---|---|---|---|---|---|---|---|
sales rep | date | ||||||||
Ansley Cummings | 2015-06-30 | 3593984 | 59 | 196.43 | 2214.13 | 0.165 | 250 | 342.04250 | 24 |
2015-07-31 | 3593984 | 43 | 152.77 | 1460.69 | 0.160 | 0 | 58.43250 | 28 | |
2015-08-31 | 7187968 | 77 | 437.26 | 4080.37 | 0.340 | 0 | 171.32280 | 64 | |
2015-09-30 | 6289472 | 64 | 398.20 | 2691.38 | 0.280 | 0 | 109.69920 | 63 | |
2015-10-31 | 6289472 | 86 | 437.41 | 5803.84 | 0.315 | 250 | 510.48870 | 70 | |
2015-11-30 | 3593984 | 25 | 259.40 | 1715.97 | 0.160 | 0 | 80.40890 | 44 | |
2015-12-31 | 9883456 | 139 | 465.39 | 6820.11 | 0.450 | 0 | 288.81630 | 132 | |
2016-01-31 | 7187968 | 85 | 413.52 | 4346.59 | 0.310 | 0 | 157.10410 | 8 | |
2016-02-29 | 4492480 | 56 | 233.69 | 2561.57 | 0.185 | 250 | 347.06005 | 10 | |
2016-03-31 | 3593984 | 43 | 260.11 | 2553.24 | 0.175 | 250 | 362.13085 | 12 | |
2016-04-30 | 4492480 | 55 | 276.44 | 2855.68 | 0.200 | 0 | 123.26960 | 20 | |
2016-05-31 | 3593984 | 36 | 282.24 | 2151.62 | 0.190 | 0 | 101.28800 | 20 | |
2016-06-30 | 3593984 | 15 | 257.26 | 980.25 | 0.180 | 0 | 47.62630 | 24 | |
Beth Skiles | 2015-06-30 | 846366 | 37 | 161.44 | 1763.43 | 0.150 | 0 | 88.17150 | 18 |
2015-07-31 | 1692732 | 64 | 235.55 | 2468.03 | 0.190 | 0 | 81.93400 | 42 | |
2015-08-31 | 3385464 | 123 | 699.43 | 6275.63 | 0.470 | 0 | 283.95210 | 96 | |
2015-09-30 | 1692732 | 68 | 326.76 | 3374.01 | 0.240 | 0 | 145.42650 | 54 | |
2015-10-31 | 3103342 | 106 | 591.17 | 5248.15 | 0.410 | 0 | 204.40310 | 110 | |
2015-11-30 | 1974854 | 78 | 412.76 | 4496.55 | 0.315 | 250 | 464.13370 | 77 | |
2015-12-31 | 2539098 | 55 | 480.51 | 2970.65 | 0.360 | 0 | 131.39730 | 108 |
What if you are only interested in Mondays
df.set_index('date').groupby('sales rep').resample("W-Mon").sum().head(20)
account number | quantity | unit price | ext price | commission | bonus | comp | month | ||
---|---|---|---|---|---|---|---|---|---|
sales rep | date | ||||||||
Ansley Cummings | 2015-06-15 | 1796992 | 39 | 72.79 | 1436.45 | 0.075 | 250 | 309.1255 | 12 |
2015-06-22 | 898496 | 18 | 33.15 | 596.70 | 0.040 | 0 | 23.8680 | 6 | |
2015-06-29 | 898496 | 2 | 90.49 | 180.98 | 0.050 | 0 | 9.0490 | 6 | |
2015-07-06 | 898496 | 11 | 22.99 | 252.89 | 0.050 | 0 | 12.6445 | 7 | |
2015-07-13 | 898496 | 7 | 83.34 | 583.38 | 0.030 | 0 | 17.5014 | 7 | |
2015-07-20 | 898496 | 17 | 28.10 | 477.70 | 0.050 | 0 | 23.8850 | 7 | |
2015-07-27 | 898496 | 8 | 18.34 | 146.72 | 0.030 | 0 | 4.4016 | 7 | |
2015-08-03 | 0 | 0 | 0.00 | 0.00 | 0.000 | 0 | 0.0000 | 0 | |
2015-08-10 | 2695488 | 36 | 112.58 | 1191.71 | 0.120 | 0 | 50.4873 | 24 | |
2015-08-17 | 0 | 0 | 0.00 | 0.00 | 0.000 | 0 | 0.0000 | 0 | |
2015-08-24 | 2695488 | 30 | 170.84 | 1964.79 | 0.140 | 0 | 89.4372 | 24 | |
2015-08-31 | 1796992 | 11 | 153.84 | 923.87 | 0.080 | 0 | 31.3983 | 16 | |
2015-09-07 | 0 | 0 | 0.00 | 0.00 | 0.000 | 0 | 0.0000 | 0 | |
2015-09-14 | 2695488 | 47 | 89.22 | 1432.40 | 0.120 | 0 | 59.1218 | 27 | |
2015-09-21 | 1796992 | 8 | 162.20 | 640.40 | 0.100 | 0 | 32.0200 | 18 | |
2015-09-28 | 0 | 0 | 0.00 | 0.00 | 0.000 | 0 | 0.0000 | 0 | |
2015-10-05 | 1796992 | 9 | 146.78 | 618.58 | 0.060 | 0 | 18.5574 | 18 | |
2015-10-12 | 3593984 | 42 | 193.83 | 2216.34 | 0.170 | 0 | 87.6666 | 40 | |
2015-10-19 | 0 | 0 | 0.00 | 0.00 | 0.000 | 0 | 0.0000 | 0 | |
2015-10-26 | 2695488 | 44 | 243.58 | 3587.50 | 0.145 | 250 | 422.8221 | 30 |
for a specific month use: df.groupby([‘name’, ‘sku’, pd.Grouper(key=’date’, freq=’A-DEC’)])[‘ext price’].sum()
Here is an aggregation function that is sometimes usefull
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
ext price | quantity | unit price | |
---|---|---|---|
mean | 571.75346 | 10.411333 | 55.316493 |
sum | 857630.19000 | 15617.000000 | NaN |
You can create custom functions
get_max = lambda x: x.value_counts(dropna=False).index[0]
get_max.__name__ = "most frequent" # required for row label
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
ext price | quantity | unit price | sku | |
---|---|---|---|---|
mean | 571.75346 | 10.411333 | 55.316493 | NaN |
most frequent | NaN | NaN | NaN | TL-23025 |
sum | 857630.19000 | 15617.000000 | NaN | NaN |
It is nice to have an ordered dictionary
import collections
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)
ext price | quantity | sku | |
---|---|---|---|
mean | 571.75346 | 10.411333 | NaN |
most frequent | NaN | NaN | TL-23025 |
sum | 857630.19000 | 15617.000000 | NaN |
You can of course use any of the extensions available like below to further analysis and filtering.
import qgrid
from IPython.display import display
qgrid_widget = qgrid.show_grid(df, show_toolbar=True)
## Not sure why this is not working - giving it a skip for now
qgrid_widget
QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…
qgrid_widget.get_changed_df()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | bonus | comp | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.040 | 0 | 67.2524 | 11 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.050 | 0 | 11.7105 | 2 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.050 | 0 | 28.6995 | 8 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.050 | 0 | 54.1740 | 1 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.050 | 0 | 18.1600 | 8 |
5 | 282122 | Connelly, Abshire and Von | Beth Skiles | GJ-90272 | Shoes | 20 | 96.62 | 1932.40 | 2016-03-17 10:19:05 | 0.045 | 250 | 336.9580 | 3 |
6 | 398620 | Brekke Ltd | Esequiel Schinner | DU-87462 | Shirt | 10 | 67.64 | 676.40 | 2015-11-25 22:05:36 | 0.050 | 0 | 33.8200 | 11 |
7 | 218667 | Jaskolski-O'Hara | Trish Deckow | DU-87462 | Shirt | 11 | 91.86 | 1010.46 | 2016-04-24 15:05:58 | 0.050 | 0 | 50.5230 | 4 |
8 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 18 | 98.67 | 1776.06 | 2015-08-06 08:09:56 | 0.050 | 0 | 88.8030 | 8 |
9 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | GP-14407 | Belt | 12 | 64.48 | 773.76 | 2016-01-08 09:52:04 | 0.040 | 0 | 30.9504 | 1 |
10 | 887145 | Gislason LLC | Loring Predovic | NZ-99565 | Shirt | 20 | 92.87 | 1857.40 | 2016-05-22 06:09:58 | 0.050 | 0 | 92.8700 | 5 |
11 | 93583 | Hegmann-Howell | Esequiel Schinner | HZ-54995 | Belt | 17 | 40.56 | 689.52 | 2015-07-05 01:05:52 | 0.040 | 0 | 27.5808 | 7 |
12 | 280749 | Douglas PLC | Teagan O'Keefe | TK-29646 | Shoes | 17 | 42.61 | 724.37 | 2016-05-16 19:11:55 | 0.030 | 0 | 21.7311 | 5 |
13 | 453258 | Runolfsson-Bayer | Shannen Hudson | FI-01804 | Shirt | 20 | 12.31 | 246.20 | 2015-08-24 21:32:26 | 0.050 | 0 | 12.3100 | 8 |
14 | 453258 | Runolfsson-Bayer | Shannen Hudson | EO-54210 | Shirt | 20 | 67.95 | 1359.00 | 2015-11-24 15:04:30 | 0.050 | 0 | 67.9500 | 11 |
15 | 280749 | Douglas PLC | Teagan O'Keefe | GJ-90272 | Shoes | 17 | 23.20 | 394.40 | 2015-09-12 13:16:56 | 0.030 | 0 | 11.8320 | 9 |
16 | 453258 | Runolfsson-Bayer | Shannen Hudson | NZ-99565 | Shirt | 2 | 30.23 | 60.46 | 2016-03-21 11:31:24 | 0.050 | 0 | 3.0230 | 3 |
17 | 280749 | Douglas PLC | Teagan O'Keefe | GP-14407 | Belt | 3 | 22.64 | 67.92 | 2015-12-02 01:55:28 | 0.030 | 0 | 2.0376 | 12 |
18 | 398620 | Brekke Ltd | Esequiel Schinner | HZ-54995 | Belt | 1 | 36.20 | 36.20 | 2015-12-07 18:45:54 | 0.030 | 0 | 1.0860 | 12 |
19 | 251881 | Zulauf-Grady | Teagan O'Keefe | DU-87462 | Shirt | 9 | 82.06 | 738.54 | 2016-02-23 17:41:41 | 0.050 | 0 | 36.9270 | 2 |
20 | 575704 | Lindgren, Thompson and Kirlin | Teagan O'Keefe | DU-87462 | Shirt | 5 | 28.51 | 142.55 | 2016-03-01 19:34:33 | 0.050 | 0 | 7.1275 | 3 |
21 | 887145 | Gislason LLC | Loring Predovic | ZY-38455 | Shirt | 15 | 95.60 | 1434.00 | 2015-10-08 03:01:22 | 0.050 | 0 | 71.7000 | 10 |
22 | 575704 | Lindgren, Thompson and Kirlin | Teagan O'Keefe | TK-29646 | Shoes | 11 | 30.80 | 338.80 | 2015-11-25 06:07:47 | 0.030 | 0 | 10.1640 | 11 |
23 | 282122 | Connelly, Abshire and Von | Beth Skiles | HZ-54995 | Belt | 6 | 64.98 | 389.88 | 2016-01-15 21:07:30 | 0.030 | 0 | 11.6964 | 1 |
24 | 758030 | Kilback-Abernathy | Trish Deckow | TK-29646 | Shoes | 19 | 26.90 | 511.10 | 2015-09-28 14:33:34 | 0.030 | 0 | 15.3330 | 9 |
25 | 280749 | Douglas PLC | Teagan O'Keefe | NZ-99565 | Shirt | 15 | 30.34 | 455.10 | 2016-03-29 21:19:08 | 0.050 | 0 | 22.7550 | 3 |
26 | 680916 | Mueller and Sons | Loring Predovic | DU-87462 | Shirt | 2 | 61.01 | 122.02 | 2016-05-19 00:05:12 | 0.050 | 0 | 6.1010 | 5 |
27 | 898496 | Weissnat, Veum and Barton | Ansley Cummings | TL-23025 | Shoes | 8 | 18.34 | 146.72 | 2015-07-24 17:48:46 | 0.030 | 0 | 4.4016 | 7 |
28 | 530925 | Purdy and Sons | Teagan O'Keefe | DU-87462 | Shirt | 10 | 18.40 | 184.00 | 2016-03-05 23:45:57 | 0.050 | 0 | 9.2000 | 3 |
29 | 752312 | Watsica-Pfannerstill | Loring Predovic | GP-14407 | Belt | 15 | 17.93 | 268.95 | 2016-03-18 07:17:35 | 0.040 | 0 | 10.7580 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1470 | 680916 | Mueller and Sons | Loring Predovic | HZ-54995 | Belt | 18 | 17.93 | 322.74 | 2016-04-28 01:57:00 | 0.040 | 0 | 12.9096 | 4 |
1471 | 218667 | Jaskolski-O'Hara | Trish Deckow | TK-29646 | Shoes | 13 | 57.80 | 751.40 | 2015-11-09 12:17:52 | 0.030 | 0 | 22.5420 | 11 |
1472 | 887145 | Gislason LLC | Loring Predovic | GP-14407 | Belt | 5 | 51.82 | 259.10 | 2016-06-08 11:25:05 | 0.030 | 0 | 7.7730 | 6 |
1473 | 93583 | Hegmann-Howell | Esequiel Schinner | ZY-38455 | Shirt | 10 | 86.25 | 862.50 | 2016-05-10 23:48:57 | 0.050 | 0 | 43.1250 | 5 |
1474 | 530925 | Purdy and Sons | Teagan O'Keefe | FI-01804 | Shirt | 13 | 10.13 | 131.69 | 2016-04-01 20:39:41 | 0.050 | 0 | 6.5845 | 4 |
1475 | 752312 | Watsica-Pfannerstill | Loring Predovic | HZ-54995 | Belt | 18 | 23.48 | 422.64 | 2015-07-25 05:51:10 | 0.040 | 0 | 16.9056 | 7 |
1476 | 758030 | Kilback-Abernathy | Trish Deckow | TK-29646 | Shoes | 7 | 86.67 | 606.69 | 2015-08-19 15:35:31 | 0.030 | 0 | 18.2007 | 8 |
1477 | 398620 | Brekke Ltd | Esequiel Schinner | GP-14407 | Belt | 7 | 16.52 | 115.64 | 2016-04-07 10:25:42 | 0.030 | 0 | 3.4692 | 4 |
1478 | 218667 | Jaskolski-O'Hara | Trish Deckow | GJ-90272 | Shoes | 6 | 36.44 | 218.64 | 2015-11-02 20:55:11 | 0.030 | 0 | 6.5592 | 11 |
1479 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | HZ-54995 | Belt | 5 | 48.52 | 242.60 | 2016-05-22 12:34:35 | 0.030 | 0 | 7.2780 | 5 |
1480 | 218667 | Jaskolski-O'Hara | Trish Deckow | TL-23025 | Shoes | 14 | 88.33 | 1236.62 | 2015-07-12 15:59:56 | 0.045 | 250 | 305.6479 | 7 |
1481 | 530925 | Purdy and Sons | Teagan O'Keefe | HZ-54995 | Belt | 9 | 62.85 | 565.65 | 2015-08-12 17:07:20 | 0.030 | 0 | 16.9695 | 8 |
1482 | 575704 | Lindgren, Thompson and Kirlin | Teagan O'Keefe | GP-14407 | Belt | 9 | 55.57 | 500.13 | 2016-02-27 03:20:13 | 0.030 | 0 | 15.0039 | 2 |
1483 | 758030 | Kilback-Abernathy | Trish Deckow | GJ-90272 | Shoes | 8 | 98.87 | 790.96 | 2016-02-10 16:51:59 | 0.030 | 0 | 23.7288 | 2 |
1484 | 136521 | Labadie and Sons | Esequiel Schinner | FI-01804 | Shirt | 15 | 99.43 | 1491.45 | 2015-12-17 21:58:31 | 0.050 | 0 | 74.5725 | 12 |
1485 | 898496 | Weissnat, Veum and Barton | Ansley Cummings | GJ-90272 | Shoes | 18 | 11.05 | 198.90 | 2016-04-01 10:19:26 | 0.030 | 0 | 5.9670 | 4 |
1486 | 136521 | Labadie and Sons | Esequiel Schinner | HZ-54995 | Belt | 20 | 54.67 | 1093.40 | 2016-02-20 10:43:41 | 0.040 | 0 | 43.7360 | 2 |
1487 | 218667 | Jaskolski-O'Hara | Trish Deckow | TL-23025 | Shoes | 19 | 45.97 | 873.43 | 2015-08-12 22:53:38 | 0.030 | 0 | 26.2029 | 8 |
1488 | 575704 | Lindgren, Thompson and Kirlin | Teagan O'Keefe | TL-23025 | Shoes | 1 | 55.81 | 55.81 | 2016-02-01 17:50:13 | 0.030 | 0 | 1.6743 | 2 |
1489 | 887145 | Gislason LLC | Loring Predovic | ZY-38455 | Shirt | 10 | 16.37 | 163.70 | 2015-10-11 22:11:45 | 0.050 | 0 | 8.1850 | 10 |
1490 | 898496 | Weissnat, Veum and Barton | Ansley Cummings | GP-14407 | Belt | 8 | 83.37 | 666.96 | 2016-01-27 03:26:53 | 0.030 | 0 | 20.0088 | 1 |
1491 | 62592 | O'Keefe-Koch | Shannen Hudson | TK-29646 | Shoes | 18 | 24.92 | 448.56 | 2016-02-11 17:13:23 | 0.030 | 0 | 13.4568 | 2 |
1492 | 530925 | Purdy and Sons | Teagan O'Keefe | HZ-54995 | Belt | 3 | 90.33 | 270.99 | 2015-09-23 07:36:34 | 0.030 | 0 | 8.1297 | 9 |
1493 | 93583 | Hegmann-Howell | Esequiel Schinner | GP-14407 | Belt | 5 | 45.93 | 229.65 | 2016-04-13 22:34:56 | 0.030 | 0 | 6.8895 | 4 |
1494 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | ZY-38455 | Shirt | 16 | 21.60 | 345.60 | 2015-11-18 06:28:56 | 0.050 | 0 | 17.2800 | 11 |
1495 | 136521 | Labadie and Sons | Esequiel Schinner | GP-14407 | Belt | 4 | 98.57 | 394.28 | 2016-06-01 17:28:44 | 0.030 | 0 | 11.8284 | 6 |
1496 | 575704 | Lindgren, Thompson and Kirlin | Teagan O'Keefe | TK-29646 | Shoes | 3 | 65.16 | 195.48 | 2016-04-02 16:38:31 | 0.030 | 0 | 5.8644 | 4 |
1497 | 898496 | Weissnat, Veum and Barton | Ansley Cummings | EO-54210 | Shirt | 17 | 28.10 | 477.70 | 2015-07-20 19:30:10 | 0.050 | 0 | 23.8850 | 7 |
1498 | 62592 | O'Keefe-Koch | Shannen Hudson | FI-01804 | Shirt | 19 | 94.96 | 1804.24 | 2015-10-05 15:55:01 | 0.050 | 0 | 90.2120 | 10 |
1499 | 530925 | Purdy and Sons | Teagan O'Keefe | DU-87462 | Shirt | 3 | 22.86 | 68.58 | 2015-10-07 19:49:38 | 0.050 | 0 | 3.4290 | 10 |
1500 rows × 13 columns
df.head()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | bonus | comp | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.04 | 0 | 67.2524 | 11 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.05 | 0 | 11.7105 | 2 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.05 | 0 | 28.6995 | 8 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.05 | 0 | 54.1740 | 1 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.05 | 0 | 18.1600 | 8 |
df.groupby("category").agg({"quantity":["count","size"],"unit price":["sum"],"ext price":['mean']})
quantity | unit price | ext price | ||
---|---|---|---|---|
count | size | sum | mean | |
category | ||||
Belt | 298 | 298 | 15754.05 | 548.017852 |
Shirt | 734 | 734 | 41696.16 | 581.485817 |
Shoes | 468 | 468 | 25524.53 | 571.603162 |
What is nice about the pivot table, is that you have three dataframe parameters, index, columns, and values to adjust, wheras the groupby function only use index and values.
df.pivot_table(index=["month","sales rep"],columns=["category"], values=["bonus","comp"],aggfunc=[np.sum, np.size],fill_value="No Sale").head(12)
sum | size | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bonus | comp | bonus | comp | ||||||||||
category | Belt | Shirt | Shoes | Belt | Shirt | Shoes | Belt | Shirt | Shoes | Belt | Shirt | Shoes | |
month | sales rep | ||||||||||||
1 | Ansley Cummings | 0 | 0 | 0 | 84.7025 | 34.094 | 38.3076 | 3 | 3 | 2 | 3 | 3 | 2 |
Beth Skiles | 0 | 0 | 0 | 24.4992 | 35.6795 | 17.9136 | 3 | 2 | 1 | 3 | 2 | 1 | |
Esequiel Schinner | 0 | 0 | 750 | 120.595 | 348.077 | 1013.11 | 3 | 13 | 8 | 3 | 13 | 8 | |
Loring Predovic | 0 | 0 | 250 | 37.9589 | 252.68 | 365.321 | 6 | 8 | 4 | 6 | 8 | 4 | |
Shannen Hudson | 0 | 0 | 0 | 116.422 | 201.026 | 1.6809 | 3 | 6 | 1 | 3 | 6 | 1 | |
Teagan O'Keefe | 0 | 0 | 0 | 42.5652 | 660.813 | 28.7259 | 2 | 21 | 3 | 2 | 21 | 3 | |
Trish Deckow | 0 | 0 | 0 | 113.095 | 172.417 | 20.0688 | 7 | 8 | 3 | 7 | 8 | 3 | |
2 | Ansley Cummings | No Sale | 0 | 250 | No Sale | 10.0695 | 336.991 | No Sale | 1 | 4 | No Sale | 1 | 4 |
Beth Skiles | No Sale | 0 | 1000 | No Sale | 17.809 | 1244.55 | No Sale | 2 | 5 | No Sale | 2 | 5 | |
Esequiel Schinner | 0 | 0 | 250 | 153.831 | 174.106 | 413.933 | 6 | 8 | 9 | 6 | 8 | 9 | |
Loring Predovic | 0 | 0 | 1000 | 82.4416 | 366.779 | 1345 | 4 | 11 | 10 | 4 | 11 | 10 | |
Shannen Hudson | 0 | 0 | 250 | 38.8498 | 152.086 | 333.645 | 3 | 4 | 3 | 3 | 4 | 3 |
In the privot table below, only certain calculations are applied to certain columns. It is more selective.
# this is whithout brackets, so the type of calculation is not displayed
#df.pivot_table(index=["month","sales rep"],columns=["category"], values=["comp","bonus"],aggfunc={"comp":np.sum, "bonus":np.size},fill_value="No Sale").head(12)
df.pivot_table(index=["month","sales rep"],aggfunc={"comp":[np.sum], "bonus":[np.size]},columns=["category"], values=["comp","bonus"],fill_value="No Sale").head(12)
bonus | comp | ||||||
---|---|---|---|---|---|---|---|
size | sum | ||||||
category | Belt | Shirt | Shoes | Belt | Shirt | Shoes | |
month | sales rep | ||||||
1 | Ansley Cummings | 3 | 3 | 2 | 84.7025 | 34.094 | 38.3076 |
Beth Skiles | 3 | 2 | 1 | 24.4992 | 35.6795 | 17.9136 | |
Esequiel Schinner | 3 | 13 | 8 | 120.595 | 348.077 | 1013.11 | |
Loring Predovic | 6 | 8 | 4 | 37.9589 | 252.68 | 365.321 | |
Shannen Hudson | 3 | 6 | 1 | 116.422 | 201.026 | 1.6809 | |
Teagan O'Keefe | 2 | 21 | 3 | 42.5652 | 660.813 | 28.7259 | |
Trish Deckow | 7 | 8 | 3 | 113.095 | 172.417 | 20.0688 | |
2 | Ansley Cummings | No Sale | 1 | 4 | No Sale | 10.0695 | 336.991 |
Beth Skiles | No Sale | 2 | 5 | No Sale | 17.809 | 1244.55 | |
Esequiel Schinner | 6 | 8 | 9 | 153.831 | 174.106 | 413.933 | |
Loring Predovic | 4 | 11 | 10 | 82.4416 | 366.779 | 1345 | |
Shannen Hudson | 3 | 4 | 3 | 38.8498 | 152.086 | 333.645 |
df_pivot = df.pivot_table(index=["month","sales rep"],aggfunc={"comp":[np.sum], "bonus":[np.size]},columns=["category"], values=["comp","bonus"],fill_value="No Sale")
You can now if you feel the need to, do some querying
df_pivot.query("month == [1]")
bonus | comp | ||||||
---|---|---|---|---|---|---|---|
size | sum | ||||||
category | Belt | Shirt | Shoes | Belt | Shirt | Shoes | |
month | sales rep | ||||||
1 | Ansley Cummings | 3 | 3 | 2 | 84.7025 | 34.094 | 38.3076 |
Beth Skiles | 3 | 2 | 1 | 24.4992 | 35.6795 | 17.9136 | |
Esequiel Schinner | 3 | 13 | 8 | 120.595 | 348.077 | 1013.11 | |
Loring Predovic | 6 | 8 | 4 | 37.9589 | 252.68 | 365.321 | |
Shannen Hudson | 3 | 6 | 1 | 116.422 | 201.026 | 1.6809 | |
Teagan O'Keefe | 2 | 21 | 3 | 42.5652 | 660.813 | 28.7259 | |
Trish Deckow | 7 | 8 | 3 | 113.095 | 172.417 | 20.0688 |
This is another way to do it, I find it more reliable
df_pivot[df_pivot.index.get_level_values(0).isin([1])]
bonus | comp | ||||||
---|---|---|---|---|---|---|---|
size | sum | ||||||
category | Belt | Shirt | Shoes | Belt | Shirt | Shoes | |
month | sales rep | ||||||
1 | Ansley Cummings | 3 | 3 | 2 | 84.7025 | 34.094 | 38.3076 |
Beth Skiles | 3 | 2 | 1 | 24.4992 | 35.6795 | 17.9136 | |
Esequiel Schinner | 3 | 13 | 8 | 120.595 | 348.077 | 1013.11 | |
Loring Predovic | 6 | 8 | 4 | 37.9589 | 252.68 | 365.321 | |
Shannen Hudson | 3 | 6 | 1 | 116.422 | 201.026 | 1.6809 | |
Teagan O'Keefe | 2 | 21 | 3 | 42.5652 | 660.813 | 28.7259 | |
Trish Deckow | 7 | 8 | 3 | 113.095 | 172.417 | 20.0688 |
df_pivot[df_pivot.index.get_level_values(1).isin(["Ansley Cummings"])]
bonus | comp | ||||||
---|---|---|---|---|---|---|---|
size | sum | ||||||
category | Belt | Shirt | Shoes | Belt | Shirt | Shoes | |
month | sales rep | ||||||
1 | Ansley Cummings | 3 | 3 | 2 | 84.7025 | 34.094 | 38.3076 |
2 | Ansley Cummings | No Sale | 1 | 4 | No Sale | 10.0695 | 336.991 |
3 | Ansley Cummings | No Sale | 2 | 2 | No Sale | 48.57 | 313.561 |
4 | Ansley Cummings | 1 | 2 | 2 | 50.112 | 62.678 | 10.4796 |
5 | Ansley Cummings | 1 | 3 | No Sale | 25.172 | 76.116 | No Sale |
6 | Ansley Cummings | 2 | 4 | 2 | 25.9473 | 54.596 | 309.125 |
7 | Ansley Cummings | No Sale | 2 | 2 | No Sale | 36.5295 | 21.903 |
8 | Ansley Cummings | 2 | 4 | 2 | 51.198 | 90.2805 | 29.8443 |
9 | Ansley Cummings | 2 | 3 | 2 | 39.3054 | 49.317 | 21.0768 |
10 | Ansley Cummings | 1 | 4 | 2 | 20.1152 | 154.215 | 336.159 |
11 | Ansley Cummings | 2 | 2 | No Sale | 8.0844 | 72.3245 | No Sale |
12 | Ansley Cummings | 2 | 5 | 4 | 73.1176 | 164.834 | 50.8647 |
df.head()
account number | customer name | sales rep | sku | category | quantity | unit price | ext price | date | commission | bonus | comp | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 680916 | Mueller and Sons | Loring Predovic | GP-14407 | Belt | 19 | 88.49 | 1681.31 | 2015-11-17 05:58:34 | 0.04 | 0 | 67.2524 | 11 |
1 | 680916 | Mueller and Sons | Loring Predovic | FI-01804 | Shirt | 3 | 78.07 | 234.21 | 2016-02-13 04:04:11 | 0.05 | 0 | 11.7105 | 2 |
2 | 530925 | Purdy and Sons | Teagan O'Keefe | EO-54210 | Shirt | 19 | 30.21 | 573.99 | 2015-08-11 12:44:38 | 0.05 | 0 | 28.6995 | 8 |
3 | 14406 | Harber, Lubowitz and Fahey | Esequiel Schinner | NZ-99565 | Shirt | 12 | 90.29 | 1083.48 | 2016-01-23 02:15:50 | 0.05 | 0 | 54.1740 | 1 |
4 | 398620 | Brekke Ltd | Esequiel Schinner | NZ-99565 | Shirt | 5 | 72.64 | 363.20 | 2015-08-10 07:16:03 | 0.05 | 0 | 18.1600 | 8 |